-
Notifications
You must be signed in to change notification settings - Fork 428
Defer MonitorUpdatingPersister writes to flush() #4317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! I see this is a draft PR. |
79e9390 to
c8405e2
Compare
| use core::mem; | ||
| use core::ops::Deref; | ||
| use core::pin::{pin, Pin}; | ||
| use core::pin::pin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to do this without touching persist.rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would we do it then? Queue in ChainMonitor, in KVStore, or somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChainMonitor would need to store the actual monitor data to defer writes, not just track update IDs as it does now. This means either cloning expensive ChannelMonitor objects or storing serialized bytes, which leaks persistence format details into ChainMonitor?
c8405e2 to
40e909a
Compare
93ff6c9 to
181a6e0
Compare
|
ldk-node's channel_full_cycle test now passes with deferred monitor writes. This validates that channel open, bidirectional payments, and close all work correctly when persist methods return InProgress and actual writes happen on flush(). The critical piece is calling channel_monitor_updated after each flush to unblock the channel - without this, channels would stay paused forever waiting for persistence confirmation. Not yet tested: node restart scenarios where the manager/monitor ordering invariant matters, which is the primary motivation for this change. |
181a6e0 to
a63bd21
Compare
Update MonitorUpdatingPersister and MonitorUpdatingPersisterAsync to queue persist operations in memory instead of writing immediately to disk. The Persist trait methods now return ChannelMonitorUpdateStatus:: InProgress and the actual writes happen when flush() is called. This fixes a race condition that could cause channel force closures: previously, if the node crashed after writing channel monitors but before writing the channel manager, the monitors would be ahead of the manager on restart. By deferring monitor writes until after the channel manager is persisted (via flush()), we ensure the manager is always at least as up-to-date as the monitors. The flush() method takes a count parameter specifying how many queued writes to flush. The background processor captures the queue size before persisting the channel manager, then flushes exactly that many writes afterward. This prevents flushing monitor updates that arrived after the manager state was captured. Key changes: - Add PendingWrite enum with FullMonitor and Update variants for queued writes - Add pending_writes queue to MonitorUpdatingPersisterAsyncInner - Add pending_write_count() and flush(count) to Persist trait and ChainMonitor - ChainMonitor::flush() calls channel_monitor_updated for each completed write - Stale update cleanup happens in flush() after full monitor is written - Call flush() in background processor after channel manager persistence Co-Authored-By: Claude Opus 4.5 <[email protected]>
a63bd21 to
6ed79d9
Compare
|
The ldk-node payment benchmark (1000 payments, 10 iterations) shows a ~14% slowdown with deferred writes: main: 9.65s median This is expected since we now queue writes in memory and flush them after channel manager persistence, adding overhead to the persist cycle. The tradeoff is correctness - ensuring monitors are never ahead of the manager on disk after a crash. |
Update MonitorUpdatingPersister and MonitorUpdatingPersisterAsync to queue persist operations in memory instead of writing immediately to disk. The Persist trait methods now return ChannelMonitorUpdateStatus:: InProgress and the actual writes happen when flush() is called.
This fixes a race condition that could cause channel force closures: previously, if the node crashed after writing channel monitors but before writing the channel manager, the monitors would be ahead of the manager on restart. By deferring monitor writes until after the channel manager is persisted (via flush()), we ensure the manager is always at least as up-to-date as the monitors.
Key changes: